Model Selection and Exploration

Now that we have cleaned our indicators from the previously described problems we will be focusing in selecting a good model to perform text similarity.

When working with NLP and text similarity there are a few ways in which we can approach the problem, being one of the most used the vectorizers. This type of algorithms count the words present in a sentence and check if they're in other sentence we are comparing, so the more words exist in both texts the more similar they will be. The most well-known method for doing so is the TF-IDF Vectorizer, it is better than the basic Count Vectorizer as it considers overall document weightage of a word, resulting in a much less biased model.

Even though these algorithms are decent for simple tasks, in this case we will have tons of text data from formal documents including synonyms and very specific words that won't be repeated as much, thus we will not be using them but instead we will try to apply pre-trained transformers (word embeddings) from HuggingFace, specifically we will try the next three models (selected by the performance ranking that can be found here):

  • all-mpnet-base-v2 - Best performance overall, not very heavy.

  • gtr-t5-xxl - Huge model trained with 2B+ question-answer pairs, slightly better overall performance than sentence-t5-xxl.

  • all-roberta-large-v1 - Third best performing sentence embedding, not as heavy as gtr-t5-xxl.

Unfortunately, the gtr-t5-xxl model is too big to load it into our RAM, so we will instead replace it with (the apparently deprecated) stsb-roberta-large.

Model Selection

As we do not have pairs of sentences to train or to evaluate the model, we will encode the indicators and make two clusters, interpreting one will be general and the other will be cultural so we can have an idea of which of the models perform the best. This will not be an ndicators Analysis, but rather check whether or not the following models will be able to handle our data.

Surprisingly, the MPNET turns out to be very far from what we expected. It doesn't mean it is worse, but for our objective it is not as good as the other two, as it's worse in every aspect. The Roberta-All has a slightly better accuracy than the Roberta-stsb, but the later one has a considerably higher F1-Score as well as a better distributed confusion matrix. With this in mind, we will be using for this project the Roberta-stsb model.

Model Exploration

Now that we have chosen a model we will be exploring what it has to offer, starting with a 3D visualization of our indicators when compressed through a three components PCA to visually understand if there is a difference between cultural (red) and general (blue) indicators.

We see there's indeed a reasonable difference between the two types of indicators, even though it is not as strong as it could be.

In order to understand if the model is good enough to accomplish what we need, we will try to cluster the three components into different groups so we can have smaller and more consistent chunks of indicators. In order to do that we will use the KMeans algorithm.

KMeans Optimization

As we can see the best number of clusters for having the most efficient indicator representation is 3. In the plot below you can click on the legend (type in the right side general - cultural) to hide the type value you want to ignore so you can focus in the distribution of one of them.

Chi-Square (Contingency Table)

The differences between the clusters are not huge, but they do exist. We can see in the plot above that the distribution is far from being random, and it shows that the second cluster (Pink one) is the one that contains the cultural indicators, with a p-value of 2.10e-32, completely rejecting the null hypothesis.

This is due to the low number of cultural indicators compared to the general ones. As we are just using 3 clusters we should expect 1096 / 3 indicators for each group approximately (around 365), but we only have 274 cultural indicators and some of them are not that exclusive so we can expect some "false negatives". Overall, the ideas behind these representations can be summarized as:

  • Blue Cluster: Water, Waste, Electricity, Pollution, Vegetation and Green Saces (Environmental Cluster - Mostly General indicators)
  • Pink Cluster: Landscape, Archeology, Culture, Identity and Emergency Services (Social Impact Cluster - Mostly cultural indicators, although it contains not only culture but the human impact in the environment)
  • Yellow Cluster: Economy, Sociology, Accesibility, Transport and Architecture (Economy and Organisation Cluster - Mostly general indicators, although it has a few more cultural indicators that the Environmental cluster)

Overall it seems pretty good, and it is describing all the data only with three dimensions. We will now try to reduce the original dimensions of the base encoder, but not as much as we did so it can be even more precise for this task.

Fine Tuning (PCA Optimization)

In order to "fine-tune" the model (it's not really fine tuning as we're not looking to improve its "accuracy" or any other metric as it's unsupervised learning) we will reduce the dimensionality of the encoder to a fair number that is not 3 and so preventing both, overfitting and underfitting.

In this graph we can see the explained variance based in the PCA number of components, and it is clear that most of the variance is explained by less than 100 components (84%) but we can even reduce more this number without losing so much information as 50 components would be enough to explain the 70% of the variance. We will take a look closer to see the optimum number of components.

The most important breakpoints we have when looking at the graph are the following:

  • 44% Explained Variance: 16 columns

  • 58% Explained Variance: 29 columns

  • 72% Explained Variance: 52 columns

Looking at these breakpoints, it seems that the best option would be to retain 52 columns and get the 72% of the explained variance to prevent overfitting.

Error Analysis

Now that we have created the encoding matrix we can look at the indicators that have the most similarities between them (computing it with cosine similarity) and the ones that don't really match with any other, so we can validate that our model is working properly.

As we can see in the above histogram, most of the indicators (+50%) are fairly good matches, considering +0.75 as a good match. We will now check the best and worst matches between indicators to see its performance in extreme cases.

Best Matches

indicator closest_indicator similarity
0 Access to public transportation Access to public transport 0.998026
1 Access to public transport Access to public transportation 0.998026
2 Renewable energy Renewable Energy Production 0.981198
3 Renewable Energy Production Renewable energy 0.981198
4 Google hits for the string city name & climate change & heatwave (hits per million inhabitants) Google hits for the string city name & climate change & urban heat island (hits per million inhabitants) 0.980196
5 Google hits for the string city name & climate change & urban heat island (hits per million inhabitants) Google hits for the string city name & climate change & heatwave (hits per million inhabitants) 0.980196
6 Google hits for the string city name & climate change & sea level rise (hits per million inhabitants) Google hits for the string city name & climate change & flood (hits per million inhabitants) 0.977662
7 Google hits for the string city name & climate change & flood (hits per million inhabitants) Google hits for the string city name & climate change & sea level rise (hits per million inhabitants) 0.977662
8 Median disposable annual household income - EUR Average disposable annual household income – EUR 0.968831
9 Average disposable annual household income – EUR Median disposable annual household income - EUR 0.968831
10 Water use Water consumption 0.96699
11 Water consumption Water use 0.96699
12 Public Transport capacity Capacity of public transport 0.966592
13 Capacity of public transport Public Transport capacity 0.966592
14 Proportion of single-parent families; Proportion of households that are lone-parent households 0.962271

We can see in the top 15 matches that most of them are lexic variations (Access to public transport - Access to public transportation), minimal changes in long sentences (heatwave - urban heat island) and the most important case, synonyms recognition or context (Water use - Water consumption).

  • Lexic Variations -> Properly detected (Vectorizers + Lemmatizers should work fine too)
  • Small changes in Long Sentences -> Properly detected (Vectorizers + Lemmatizers should work fine too)
  • Synonym Recognition -> Properly detected (Vectorizers + Lemmatizers should NOT work fine as it is word-based)

In conclusion, for the best Matches our model works pertty well, even better than expected in some cases (Consumption - Use)

Worst Matches

indicator closest_indicator similarity
1081 Anthropogenic marks and footprints of human influence Ethnic heterogeneity 0.462376
1082 Presence of local government projects and other agents Number of products of denominated origin 0.457549
1083 Pollution of the façade with modern stylistics Innovative Economic Activities 0.454691
1084 Statues (sacral/non sacral) Percentage of Aboriginal people speaking traditional languages 0.452585
1085 Proportion of residents who say that they would like to regularly access traditionally/commonly harvested natural resources and are able to do so as much as needed Access to parks & recreation areas 0.449002
1086 Percentage of citizens archiving electronic health records (e-health cards) Elderly population 0.447399
1087 Patch size of areas of a topographic changes Land cover change 0.446469
1088 Foreclosures Viewsheds 0.446332
1089 Study trails Creating cultural trails 0.442418
1090 Presence of SD strategies Plastic Arts 0.433087
1091 An employee spleen Proportion of working age population qualified at level 5 or 6 ISCED 0.432796
1092 Ancillary facilities Archeological reserves 0.430779
1093 Site micro-climate Environmental conditions 0.429144
1094 Rareness (% of area in the study area) Number of days ozone O3 concentrations exceed 120 μg/m3 0.424135
1095 Pubic space Vinehouse, cave 0.394867

Looking at the worst matches we can see very good matches (Anthropogenic marks and footprints of human influence - Ethnic heterogeneity) and some bad matches (Site micro-climate - personality and neighbourhood) as well as some cleaning errors in translation (pubic spaces, although it is able to match it reasonably well). It is not clear whether we should keep these indicators as they might interfere in the final result (as overfitting), in order to decide properly we will check 15 more indicators.

indicator closest_indicator similarity
1066 Informal settlements Participation and Partnership 0.498451
1067 Adoption of telemedicine (remote diagnosis and treatment) Availability and penetration of e-learning and distance education systems 0.496861
1068 International embeddedness/International integration Aggregation index of Green Space 0.494843
1069 Cemetries Emergency Operation Centers 0.49433
1070 Sopping mall Wealth Index 0.490135
1071 Hotel architecture harmonized with environment Quality of housing units 0.49012
1072 Places of events Number of protected designated sites 0.489932
1073 Consumption Affordability 0.48736
1074 Experience with local natural- non-agricultural and non-timber products Existence of fruit trees 0.485886
1075 Standard of living Median household income 0.485205
1076 AHB GHG emissions 0.482776
1077 Selective collect Tax burden 0.473344
1078 Ethnic, cultural, and gender equality (income, access to opportunities, etc.) Accessibility for disadvantaged 0.470033
1079 Commercial activities linked to Crats, Artisans, Craftsmanship, Organizations listed on Councils Artist Register 0.468634
1080 Patent Applications / Registration for Inhabitant Toll parking 0.468112

Again we face the same problem, but it seems it is better than the last 15 as it's not completely clear that these indicators are indeed bad matches (excluding Porosity and Critical distances) but there are some that doesn't find a proper match (Critical distances in the network). Even if there are some that are not as good as they should be, the model works well enough not to remove indicators under a certain threshold.

Random Matches

indicator closest_indicator similarity
1014 Activity rate Growth rate 0.552698
849 Development Outside Cities New urban centers in the non-urbanized sector in relation to the total growth of the territory 0.638464
761 Reduced Parking Footprint Restricted traffic zones 0.672505
666 Traditional and environmental engineering Environmental Management System 0.702673
636 Expenditure for purchase of vehicles Expenditure for operation of personal transport equipment 0.711775
631 Cost of living p.p. year Household income p. year (mean or median) 0.714251
568 Crime Prevention Safety 0.731966
545 Streets with sidewalks Pedestrian zones 0.737982
486 Rate of citizen particip. in public affairs (public hearings) Community participation in process and type of participation 0.755057
270 Health, safety and enivironmental urban rules Health, safety and enivironmental media programs 0.823455
107 Processes: governance of cultural ecosystem services indicator Structures: governance of cultural ecosystem services indicator 0.883749
96 housing diversity (age in place), housing conditions Housing Diversity 0.887597
56 Heritage and Cultural Identity Historical/cultural heritage 0.922005
51 Google hits for the string city name & climate change & drought (hits per million inhabitants) Google hits for the string city name & climate change & flood (hits per million inhabitants) 0.930669
44 Share of blocks served by Green Space > 0.5 ha Share of population served by Green Space > 0.5 ha 0.940873

We can see in this random sample that our model does work very well regardless of the similarity value, even in the "bad matches" there were some accurate relations between some indicators that were not obvius (Activity rate - Growth rate). We will leave it as it is for now.

General and Cultural Indicators Matching

One of our objectives was to match general with cultural indicators, and so we will be doing here. Right below there is a sample of these matches:

index indicator_1 indicator_2 indicator_3 indicator_4 indicator_5 similarity_1 similarity_2 similarity_3 similarity_4 similarity_5 type
Amount of land owned by the household Land lot sizes The age of land Landscape value index Acres of preserved land Proximity to residential areas 0,65 0,52 0,48 0,46 0,45 general
Proportion of people who have moved in the last year; No. of visitors and tourists Patch size of areas of a topographic changes Distance from the centre of the old town zone Change in tradition and customs Monitor new knowledge and changes in traditional use patterns 0,43 0,4 0,4 0,38 0,38 general
Area of the park Parkland Areas of importance for landscape-related recreational use or tourism Places of socialisation (parks, benches, etc.) Sense of place landscapes natural and cultural landscapes and components of landscape with importance for natural and cultural heritage 0,88 0,64 0,62 0,52 0,45 general
Research and development expenditure Research centers Development of a public arts fund Funds for building renovation Investments required for restoration of cultural property Funds for the improvement of the physical urban environment 0,44 0,44 0,43 0,41 0,4 general
Penetration level of clean and renewable energy sources Commodifcation Traditional and environmental engineering Arable field indicator Cleanliness and Improvements Degree of unobstructed view 0,33 0,32 0,31 0,29 0,28 general
Children and young people from 4 to 17years old at school Educational Promotion and education in Past Identities programmes and spaces Workspaces of Artisans, Intellectuals & youth Number of cultural–educational programs. Scientific/educational landscapes 0,69 0,52 0,48 0,39 0,39 general
Urban sprawl Number of cultural sites affected by natural disasters and urban expansion Urban green space Porosity Attractive streetscape Enhancement urban green 0,47 0,45 0,42 0,37 0,35 general
Applications for mobile devices Number of citizens initiatives Social preferences No. of visitors and tourists Patch size of areas of a topographic changes Recreation (motorized/non-motorized) 0,43 0,4 0,37 0,34 0,32 general
City's Employment / UNEmployment Rate, Measures to Combat Unployment Effect on local business and jobs Parking places in building surroundings Organizations listed on Councils Artist Register Pressure of parking Type and amount of training given to tourism employees 0,51 0,39 0,35 0,34 0,33 general
Safe streets Safeguarding the terraced landscape Percent satisfied with cultural integrity/sense of security Comfortable use Plan that does not spoil natural and historical environment Well-preserved village 0,58 0,56 0,46 0,4 0,39 general
Regional Materials Barrow (traditional elements) Archeological reserves Historic relic indicator Rareness (% of area in the study area) Memorial stones 0,5 0,44 0,43 0,4 0,39 general
Government transfer payments Effect on Property Value Values Community arts funding Revenues from tourism (% farm income) Availability of cultural site maintenance fund and resource 0,51 0,42 0,37 0,34 0,32 general
Presence of major international and domestic enterprises and entities Effect on local business and jobs Listed built elements Number of man-made structures with a function Number of products of denominated origin Workspaces of Artisans, Intellectuals & youth 0,42 0,41 0,39 0,38 0,37 general
Churches Safe cultural and religious sites Primary school Feeling of belonging Noise Childcare facilities 0,47 0,45 0,44 0,4 0,4 cultural
Pressure of parking Parking fees Toll parking Parking facilities Restricted traffic zones Public car parking availability 0,85 0,77 0,76 0,72 0,69 cultural
Out[28]:
0 1 2 3 4 5 6 7 8 9 ... 42 43 44 45 46 47 48 49 50 51
indicator
Harmony with the surroundings 7.282057 -2.454299 0.151753 -2.283128 2.003885 0.683637 -1.120911 -0.287806 1.056961 4.078291 ... 2.520448 -0.831086 -0.847964 3.361331 -1.868152 -1.172748 0.371100 -0.916223 1.565204 1.786492
Amenities -2.929569 -2.631266 1.350796 3.430605 -2.249266 -6.104327 0.009646 -2.862606 1.771063 7.168769 ... 0.499023 0.446307 -0.197878 0.533697 0.713515 -2.697555 2.059854 1.263676 0.211504 0.375779
Cycling facilities 0.283958 -1.673766 -3.109287 0.001445 -0.427341 -3.065259 -4.241240 0.033688 0.203370 0.459429 ... -0.630681 -1.505569 -1.203669 1.643100 -4.772877 0.518571 -1.652187 -2.894972 0.051957 4.811692
Walking facilities -1.817154 -3.651871 -0.681428 2.151489 3.051410 -6.931185 -2.693790 3.145501 -7.966206 -1.727127 ... -1.211080 -0.540630 -0.818467 -4.836381 -3.111324 -0.439630 -0.781686 3.497477 -1.884980 -1.406166
Education facilities 2.011796 -8.004374 -0.485804 5.129246 -1.990282 -7.747918 1.711529 -2.173116 3.865458 2.762686 ... -0.355378 0.431873 -0.504031 -1.713107 -2.118815 0.497597 -0.202879 -1.154281 0.491547 0.755116
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Experience with local natural- non-agricultural and non-timber products 0.629891 3.188698 9.487859 -5.409531 -1.355939 2.023808 2.356576 -1.524996 2.981369 -0.988428 ... 1.362939 2.521367 1.970316 6.177068 -2.211789 -2.123587 1.038951 -1.674013 -1.286845 -2.301522
Existence of fruit trees 1.770916 2.930512 9.374191 -0.954374 -0.187099 -3.182366 5.131309 -0.250791 6.320688 -5.173204 ... 3.306871 -1.928208 -0.889549 1.248320 3.154093 -0.866540 3.149088 -0.244128 1.911147 0.299484
Anthropogenic marks and footprints of human influence 0.571731 -6.737236 6.655488 0.004466 -1.787687 -0.494542 1.569560 -2.861990 0.945075 -3.760969 ... 0.208821 0.255918 -0.985719 2.438982 1.099828 -0.132585 -1.993902 0.684104 -4.518700 1.109912
Composite tree risk index 2.781933 5.443534 3.919466 0.889315 4.890152 -1.232697 6.145800 6.295201 4.344382 -5.184671 ... 1.654241 0.304901 -0.524108 1.805785 4.905205 2.792621 -0.753775 -0.579769 0.142503 -2.480524
Garden 2.567378 3.250693 2.985569 -0.557150 0.370176 -6.886905 -1.756595 6.515772 -1.228029 -3.281694 ... 1.369079 -3.142811 -1.263017 -1.363452 -0.761920 -0.256860 1.321203 1.962725 1.278469 1.742588

1096 rows × 52 columns

Indicators Match Visualizer

In the 3D visualization below we will be able to see the top 5 matches for the selected indicator and their positions in the 3 components PCA. Note that this filter will not be available in the HTML file as the filter cannot be embedded as a filter, it will only be available in the notebook version.